dingo: a Python package for metabolic flux sampling

Abstract   We present dingo, a Python package that supports a variety of methods to sample from the flux space of metabolic models, based on state-of-the-art random walks and rounding methods. For uniform sampling, dingo’s sampling methods provide significant speed-ups and outperform existing software. Indicatively, dingo can sample from the flux space of the largest metabolic model up to now (Recon3D) in less than a day using a personal computer, under several statistical guarantees; this computation is out of reach for other similar software. In addition, dingo supports common analysis methods, such as flux balance analysis and flux variability analysis, and visualization components. dingo contributes to the arsenal of tools in metabolic modelling by enabling flux sampling in high dimensions (in the order of thousands). Availability and implementation The dingo Python library is available in GitHub at https://github.com/GeomScale/dingo and the data underlying this article are available in https://doi.org/10.5281/zenodo.10423335.

Our contribution, a Python library called dingo, supports a suite of sampling approaches (Table 1) initially implemented in C++ as parts of the volesti library [5].The optimized implementation of Billiard Walk [15] (BW) supported in dingo employs boundary reflections to outperform the mixing time of Hit-and-Run algorithms in practice [4,6].This implementation of BW achieves a reduced cost per step compared to previous BW implementations [15,9,37] and it is faster by a factor of n for convex polytopes.
Additionally, dingo provides Multiphase Monte Carlo Sampling (MMCS) [6], and isotropic rounding that, to our knowledge, are not available at any other software package.In brief, MMCS constructs a sequence of polytopes (phases) such that sampling is accelerated in each phase.In each phase, MMCS samples from the corresponding polytope and then applies to it a linear transformation that maps the sample to an isotropic position, to obtain the polytope of the next phase.All the samples are mapped back to the initial polytope.Thus, MMCS obtains both a uniform sample and a polytope in near isotropic position upon termination.MMCS is the first method that unifies rounding and sampling in one pass and it is typically applied on a nonrounded polytope.Therefore, the rounding preprocess is not necessary for MMCS.A parallel implementation of MMCS is also supported.
To guarantee the quality of the samples dingo computes the total Effective Sample Size (ESS) and the overall Potential Scale Reduction Factor (PSRF) during sampling and stops when both achieve a certain threshold given by the user.

Performance comparisons
To illustrate the efficiency of dingo we compare its implementations against the best existing implementation of CHRR 2 .HOPS library is reported to be six times faster for sampling than cobra [18] while both implementing CHRR.However, since PolyRound provides the most efficient implementation for the rounding step in CHRR, we compare against the combination of PolyRound and HOPS.In particular, we compute the rounded polytope with PolyRound and we sample from the rounded polytope using the CHR impementation in hopsy, the python API of HOPS (Figure 1 top box and Figure 2 red line) 3 .
We test CHRR as described above against the execution times with dingo's implementations.First, we sample from the rounded polytopes obtain by PolyRound using the dingo's BW, we call it BWR (Figure 1 intermediate box and Figure 2 green line).Finally, we sample from the nonrounded polytope using the MMCS implementation of dingo (Figure 1 bottom box and Figure 2 blue line).To obtain the non-rounded polytope corresponding to a metabolic network, we use the routines of PolyRound as it also provides efficient facet redundancy removal implementations (Figure 2 yellow line).This step was common in all three sampling methods we compared.For hopsy we use a thinning of 100d for all models but Recon3D where we use 200d as suggested by the authors [18].NA means that after 10 days, hopsy was not able to converge and the process stopped at ESS=430, PSRF=1.024.
We perform our benchmarks by requesting ESS = 1000 and PSRF ≤ 1.1 for dingo.We pick these values for illustration reasons as we analyze the progress of both ESS and PSRF during the run of the sampling methods later (Figure 3).Since hopsy does not have an option to request a certain value of ESS we count its run-time until the generated sample achieves the same targets for both ESS and PSRF as in dingo.Last, in hopsy we use 5 chains following the same process as in [18].Our benchmark dataset constitutes of five models from the BiGG Models knowledgebase that correspond to polytopes of dimension ranging from 122 to 5335 including Recon2 (version 2.2) from the BioModels database and Recon3D from Virtual Metabolic Human database.The results are illustrated in Table 2 and in Figure 2.
Notably, both dingo's implementations of BWR and MMCS are faster than CHRR.BWR is is 11-33 times faster than CHRR while MMCS is 2-13 times faster depending on the dimension.BWR is faster than MMCS for the benchmarks in the Table 2.However, the gap between the  2. The runtime of PolyRound corresponds to the preprocess step of obtaining and rounding the polytope that corresponds to each metabolic model.
MMCS and BWR runtimes decreases as the dimension of the polytope increases denoting that MMCS will eventually outperform BWR for higher dimensional polytopes that correspond to bigger models.This result can be explained by the analysis of algorithms that compute the largest inscribed ellipsoid of a polytope [23,36] where the performed operations increase faster as the dimension increases than the performed operations in MMCS [6].Moreover, MMCS achieves a better rounding than PolyRound in its last phases, and thus, BW mixing rate increases.The need for enabling flux sampling in even higher dimensions in the analysis of metabolic models is certain especially as community models emerge; a community model consists of a set of metabolic networks and there are several approaches to build such a model [14].Last but not least, dingo can sample from recon3D and obtain an ESS = 1000 in less than 16 hours, while hopsy could not reach that ESS in 10 days.For details see Table 2.
For the reproducibility of our benchmark results note that we have used dingo version 0.1.0 4, PolyRound version 0.2.0 and hopsy version v0.2.0 5 .Also the scripts that we have used along with their outcomes (processed polytopes and samples) are publicly available under the following Zenodo repository [7]: https://zenodo.org/records/10423335.Figure 3 illustrates the progress of ESS and PSRF in all the three sampling methods we compare for the model iSDY 1059.To ensure a meaningful comparison, we report ESS and PSRF values at the points where the MMCS transitions to the next phase based on the proportion of points sampled.Specifically, MMCS undergoes four phases, transitioning at 29%, 58%, and 87% of the total generated points, respectively.The reported ESS and PSRF values for the other two methods are synchronized with MMCS's progression.Notably, both BWR and CHRR exhibit similar fluctuations in ESS and PSRF, yet BWR's superior convergence rate results in faster runtimes.In contrast, MMCS demonstrates less efficiency in the initial two phases compared to the latter two.This inefficiency is attributed to the time spent rounding the polytope during the 4 commit hash #03d6e65a1278753bd879ea24f 6512e7e8b95af 93 5 commit hash #4ad62ade6aa841d37f 35d24e68043b3581358ce6 3 Illustrations and statistical tools dingo provides some illustrations and statistical tools to help the user to make inference on a metabolic model.In particular, it provides probability density estimation for a flux distribution, given the marginal samples, and copula estimation and plotting.A copula is a bivariate probability distribution for which the marginal probability distribution of each variable is uniform; it can capture the dependency between two random variables (e.g., two reaction fluxes).
Recently, the virus biomass objective function (VBOF) of SARS-CoV-2 was generated and integrated with a genome-scale metabolic model of human alveolar macrophages by Renz et al. [31].In their study, the authors performed Flux Balance Analysis (FBA) combined with reactions knock-out to reveal that the knock-out of Guanlyate Kinase 1 (GK1) decreased the growth of the virus to zero, while not affecting the one of the host.Among its several approaches [34], flux sampling can be used to increase the confidence level of FBA predictions [29].As a demonstration, we performed flux sampling with dingo using the integrated human-virus model.
The flux space of the model was first sampled in an unbiased way, i.e., without requiring any optimization function to be optimized.Sampling was also performed after maximizing the host's biomass function and finally, after maximizing the VBOF.In each of these cases, a flux value distribution for each reaction of the model was obtained, instead of a single value as in the FBA case.The plots in Figure 4 illustrate the distributions of the flux values of the Tyramine Sulfotransferase (TYMSULT) and the Guanylate Kinase 1 (GK1) reactions after maximizing for the host's biomass function (Figure 4(a)) and for the VBOF (Figure 4(b)) while the vertical lines correspond to the flux values computed by the FBA method.In the first case, FBA returned a zero value for both cases while the distributions were almost identical (Figure 4(a)), implying that the different requirements of host and virus are not related with the TYMSULT reaction.That is the case for the vast majority of the reactions of the model.On the other hand, FBA highlighted that the flux of GK1 increases when VBOF is maximized.That is further highlighted by the flux densities of GK1 reaction where both the mean and the variance of the distribution were increased when VBOF was maximized (Figure 4(b)).
To capture the dependency between the virus' biomass and the flux of the GK1 reaction, a copula of the GK1 and the human biomass fluxes' distributions was computed maximizing first for VBOF.As shown in Figure 5, their negative dependency further support the findings of Renz et al.. Interestingly, the FBA flux values for GK1, were rather lower than it is more likely for them to be, both if it is the human biomass that has been maximized or VBOF.In conclusion, flux sampling has the potential to be useful in drug targeting studies [30,34].

Future work
Constraint-based approaches for the analysis of metabolic models may suffer from thermodynamically infeasible loops [33].As Beard et al. first observed [2], metabolic networks can be conceptualized as analogous to current networks, suggesting that Kirchhoff's laws should similarly apply to them.According to Kirchoff, the sum of currents flowing into and out of a junction in a network of conductors is zero (first law, the current law ) while the total potential difference (voltage) around any closed loop, when considering its direction, must be equal to zero too (second law, the voltage law ).Mass conservation applied in metabolic modeling ensures the first law.Yet, extra constraints, discussed in [2], need to be applied so the second law is ensured too to avoid flux solutions with active closed loops [33].Based on previous work [33], dingo will support checks for the thermodynamic feasibility of a sample while sampling approaches such as the recently published by Saa et al. [32] will be implemented too.Moreover, certain random walks provided by dingo -as Billiard walk-can be employed to sample from non-convex bodies [15].
While loopless sampling is strongly related to sampling in non-convex polytopes [32], another common question that rises in so-far implementations is whether it is the uniform distribution the most suitable for investigating a cell's metabolism in an unbiased way.In a previous study, De Martino et al. showed that the maximum entropy distribution [10] could be used instead coupled with experimental data for growth.dingo will support sampling using the maximum entropy distribution in the near future.
Last, community modelling in microbiome studies is growing.Approaches such as those described in [11] and [3] and implementations such as MICOM [12], COMETS [13] and BacArena [1] have combined community modelling with FBA or dynamic FBA.In a future version of dingo it is our intention to support community flux sampling using several community modelling approaches.

Figure 1 :
Figure 1: Workflow followed for comparing dingo with state-of-the-art sampling approaches.In blue boxes the approaches using the dingo Python library.CHR: Coordinate Hit-and-Run; CHRR: CHR Rounded; BW: Billiard Walk; BWR: BW Rounded; MMCS: Multiphase Monte Carlo Sampling.

Figure 2 :
Figure 2: Runtime comparison of the three sampling methods in Table2.The runtime of PolyRound corresponds to the preprocess step of obtaining and rounding the polytope that corresponds to each metabolic model.

Figure 3 :
Figure 3: The progress of Effective Sample Size (ESS) and Potential Scale Reduction Factor (PSRF) for the three sampling methods for the model iSDY 1059.The percentages at the x-axis correspond to proportions of the total number of samples when MMCS transits from a phase to the next one.We report the values for all the three models on the same proportion of points to obtain a meaningful comparison.

Figure 4 :
Figure 4: Distributions of the integrated human-virus model reactions' flux values after maximizing for the host's biomass function (red) and for the VBOF (blue).(a) In case of the TYMSULT reaction, the distribution is the same for both cases (b) Contrary, the flux value distribution of GK1 shifts when maximizing for VBOF.

Figure 5 :
Figure 5: Copula of the biomass function and the GK1 reaction flux values' distributions, after maximizing the integrated model for VBOF.A positive dependency between the two reactions is shown as the flux of the virus biomass is low most probably, the flux of GK1 is also low and the same applies in case of a high flux value.

Table 2 :
Time (in seconds) performance for sampling a metabolic model with hopsy and dingo packages.Rounding for CHRR and BWR is performed by PolyRound.The dimension d is after preprocessing.